40 research outputs found
On the Topic of Jets: Disentangling Quarks and Gluons at Colliders
We introduce jet topics: a framework to identify underlying classes of jets
from collider data. Because of a close mathematical relationship between
distributions of observables in jets and emergent themes in sets of documents,
we can apply recent techniques in "topic modeling" to extract jet topics from
data with minimal or no input from simulation or theory. As a proof of concept
with parton shower samples, we apply jet topics to determine separate quark and
gluon jet distributions for constituent multiplicity. We also determine
separate quark and gluon rapidity spectra from a mixed Z-plus-jet sample. While
jet topics are defined directly from hadron-level multi-differential cross
sections, one can also predict jet topics from first-principles theoretical
calculations, with potential implications for how to define quark and gluon
jets beyond leading-logarithmic accuracy. These investigations suggest that jet
topics will be useful for extracting underlying jet distributions and fractions
in a wide range of contexts at the Large Hadron Collider.Comment: 8 pages, 4 figures, 1 table. v2: Improved discussion to match PRL
versio
Classification without labels: Learning from mixed samples in high energy physics
Modern machine learning techniques can be used to construct powerful models
for difficult collider physics problems. In many applications, however, these
models are trained on imperfect simulations due to a lack of truth-level
information in the data, which risks the model learning artifacts of the
simulation. In this paper, we introduce the paradigm of classification without
labels (CWoLa) in which a classifier is trained to distinguish statistical
mixtures of classes, which are common in collider physics. Crucially, neither
individual labels nor class proportions are required, yet we prove that the
optimal classifier in the CWoLa paradigm is also the optimal classifier in the
traditional fully-supervised case where all label information is available.
After demonstrating the power of this method in an analytical toy example, we
consider a realistic benchmark for collider physics: distinguishing quark-
versus gluon-initiated jets using mixed quark/gluon training samples. More
generally, CWoLa can be applied to any classification problem where labels or
class proportions are unknown or simulations are unreliable, but statistical
mixtures of the classes are available.Comment: 18 pages, 5 figures; v2: intro extended and references added; v3:
additional discussion to match JHEP versio
Energy flow polynomials: A complete linear basis for jet substructure
We introduce the energy flow polynomials: a complete set of jet substructure
observables which form a discrete linear basis for all infrared- and
collinear-safe observables. Energy flow polynomials are multiparticle energy
correlators with specific angular structures that are a direct consequence of
infrared and collinear safety. We establish a powerful graph-theoretic
representation of the energy flow polynomials which allows us to design
efficient algorithms for their computation. Many common jet observables are
exact linear combinations of energy flow polynomials, and we demonstrate the
linear spanning nature of the energy flow basis by performing regression for
several common jet observables. Using linear classification with energy flow
polynomials, we achieve excellent performance on three representative jet
tagging problems: quark/gluon discrimination, boosted W tagging, and boosted
top tagging. The energy flow basis provides a systematic framework for complete
investigations of jet substructure using linear methods.Comment: 41+15 pages, 13 figures, 5 tables; v2: updated to match JHEP versio
An operational definition of quark and gluon jets
While "quark" and "gluon" jets are often treated as separate, well-defined
objects in both theoretical and experimental contexts, no precise, practical,
and hadron-level definition of jet flavor presently exists. To remedy this
issue, we develop and advocate for a data-driven, operational definition of
quark and gluon jets that is readily applicable at colliders. Rather than
specifying a per-jet flavor label, we aggregately define quark and gluon jets
at the distribution level in terms of measured hadronic cross sections.
Intuitively, quark and gluon jets emerge as the two maximally separable
categories within two jet samples in data. Benefiting from recent work on
data-driven classifiers and topic modeling for jets, we show that the practical
tools needed to implement our definition already exist for experimental
applications. As an informative example, we demonstrate the power of our
operational definition using Z+jet and dijet samples, illustrating that pure
quark and gluon distributions and fractions can be successfully extracted in a
fully well-defined manner.Comment: 38 pages, 10 figures, 1 table; v2: updated to match JHEP versio
A Theory of Quark vs. Gluon Discrimination
Understanding jets initiated by quarks and gluons is of fundamental
importance in collider physics. Efficient and robust techniques for quark
versus gluon jet discrimination have consequences for new physics searches,
precision studies, parton distribution function extractions, and
many other applications. Numerous machine learning analyses have attacked the
problem, demonstrating that good performance can be obtained but generally not
providing an understanding for what properties of the jets are responsible for
that separation power. In this paper, we provide an extensive and detailed
analysis of quark versus gluon discrimination from first-principles theoretical
calculations. Working in the strongly-ordered soft and collinear limits, we
calculate probability distributions for fixed -body kinematics within jets
with up through three resolved emissions (). This enables
explicit calculation of quantities central to machine learning such as the
likelihood ratio, the area under the receiver operating characteristic curve,
and reducibility factors within a well-defined approximation scheme. Further,
we relate the existence of a consistent power counting procedure for
discrimination to ideas for operational flavor definitions, and we use this
relationship to construct a power counting for quark versus gluon
discrimination as an expansion in , the exponential of the
fundamental and adjoint Casimirs. Our calculations provide insight into the
discrimination performance of particle multiplicity and show how observables
sensitive to all emissions in a jet are optimal. We compare our predictions to
the performance of individual observables and neural networks with parton
shower event generators, validating that our predictions describe the features
identified by machine learning.Comment: 56 pages, 17 figures; v2: corrected calculations, conclusions remain
unchanged; v3: updated to match JHEP versio
Pileup Mitigation with Machine Learning (PUMML)
Pileup involves the contamination of the energy distribution arising from the
primary collision of interest (leading vertex) by radiation from soft
collisions (pileup). We develop a new technique for removing this contamination
using machine learning and convolutional neural networks. The network takes as
input the energy distribution of charged leading vertex particles, charged
pileup particles, and all neutral particles and outputs the energy distribution
of particles coming from leading vertex alone. The PUMML algorithm performs
remarkably well at eliminating pileup distortion on a wide range of simple and
complex jet observables. We test the robustness of the algorithm in a number of
ways and discuss how the network can be trained directly on data.Comment: 20 pages, 8 figures, 2 tables. Updated to JHEP versio
Learning to Classify from Impure Samples with High-Dimensional Data
A persistent challenge in practical classification tasks is that labeled
training sets are not always available. In particle physics, this challenge is
surmounted by the use of simulations. These simulations accurately reproduce
most features of data, but cannot be trusted to capture all of the complex
correlations exploitable by modern machine learning methods. Recent work in
weakly supervised learning has shown that simple, low-dimensional classifiers
can be trained using only the impure mixtures present in data. Here, we
demonstrate that complex, high-dimensional classifiers can also be trained on
impure mixtures using weak supervision techniques, with performance comparable
to what could be achieved with pure samples. Using weak supervision will
therefore allow us to avoid relying exclusively on simulations for
high-dimensional classification. This work opens the door to a new regime
whereby complex models are trained directly on data, providing direct access to
probe the underlying physics.Comment: 6 pages, 2 tables, 2 figures. v2: updated to match PRD versio